#pip install autogluon
해당 자료는 전북대학교 최규빈 교수님 2023학년도 2학기 빅데이터분석특강 자료임
02wk-007: 타이타닉, Autogluon (Fsize,Drop)
최규빈
2023-09-12
1. 강의영상
https://youtu.be/playlist?list=PLQqh36zP38-wiSZXhNO5rMncu6h42SNDi&si=fmqkO_EQek1SgbNQ
2. Import
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))
# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All"
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session
/kaggle/input/titanic/train.csv
/kaggle/input/titanic/test.csv
/kaggle/input/titanic/gender_submission.csv
from autogluon.tabular import TabularDataset, TabularPredictor
3. 분석의 절차
A. 데이터
-
비유: 문제를 받아오는 과정으로 비유할 수 있다.
= TabularDataset("~/Desktop/titanic/train.csv")
tr = TabularDataset("~/Desktop/titanic/test.csv") tst
-
피처엔지니어링
= tr.eval('Fsize = SibSp + Parch').drop(['SibSp','Parch'],axis=1)
_tr = tst.eval('Fsize = SibSp + Parch').drop(['SibSp','Parch'],axis=1) _tst
B. Predictor 생성
-
비유: 문제를 풀 학생을 생성하는 과정으로 비유할 수 있다.
= TabularPredictor("Survived") predictr
No path specified. Models will be saved in: "AutogluonModels/ag-20230917_141245/"
C. 적합(fit)
-
비유: 학생이 공부를 하는 과정으로 비유할 수 있다.
-
학습
# 학생(predictr)에게 문제(tr)를 줘서 학습을 시킴(predictr.fit()) predictr.fit(_tr)
Beginning AutoGluon training ...
AutoGluon will save models to "AutogluonModels/ag-20230917_141245/"
AutoGluon Version: 0.8.2
Python Version: 3.8.18
Operating System: Linux
Platform Machine: x86_64
Platform Version: #26~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Jul 13 16:27:29 UTC 2
Disk Space Avail: 775.53 GB / 982.82 GB (78.9%)
Train Data Rows: 891
Train Data Columns: 10
Label Column: Survived
Preprocessing data ...
AutoGluon infers your prediction problem is: 'binary' (because only two unique label-values observed).
2 unique label values: [0, 1]
If 'binary' is not the correct problem_type, please manually specify the problem_type parameter during predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Selected class <--> label mapping: class 1 = 1, class 0 = 0
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
Available Memory: 38785.27 MB
Train Data (Original) Memory Usage: 0.31 MB (0.0% of available memory)
Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.
Stage 1 Generators:
Fitting AsTypeFeatureGenerator...
Note: Converting 1 features to boolean dtype as they only contain 2 unique values.
Stage 2 Generators:
Fitting FillNaFeatureGenerator...
Stage 3 Generators:
Fitting IdentityFeatureGenerator...
Fitting CategoryFeatureGenerator...
Fitting CategoryMemoryMinimizeFeatureGenerator...
Fitting TextSpecialFeatureGenerator...
Fitting BinnedFeatureGenerator...
Fitting DropDuplicatesFeatureGenerator...
Fitting TextNgramFeatureGenerator...
Fitting CountVectorizer for text features: ['Name']
CountVectorizer fit with vocabulary size = 8
Stage 4 Generators:
Fitting DropUniqueFeatureGenerator...
Stage 5 Generators:
Fitting DropDuplicatesFeatureGenerator...
Types of features in original data (raw dtype, special dtypes):
('float', []) : 2 | ['Age', 'Fare']
('int', []) : 3 | ['PassengerId', 'Pclass', 'Fsize']
('object', []) : 4 | ['Sex', 'Ticket', 'Cabin', 'Embarked']
('object', ['text']) : 1 | ['Name']
Types of features in processed data (raw dtype, special dtypes):
('category', []) : 3 | ['Ticket', 'Cabin', 'Embarked']
('float', []) : 2 | ['Age', 'Fare']
('int', []) : 3 | ['PassengerId', 'Pclass', 'Fsize']
('int', ['binned', 'text_special']) : 9 | ['Name.char_count', 'Name.word_count', 'Name.capital_ratio', 'Name.lower_ratio', 'Name.special_ratio', ...]
('int', ['bool']) : 1 | ['Sex']
('int', ['text_ngram']) : 9 | ['__nlp__.henry', '__nlp__.john', '__nlp__.master', '__nlp__.miss', '__nlp__.mr', ...]
0.2s = Fit runtime
10 features in original data used to generate 27 features in processed data.
Train Data (Processed) Memory Usage: 0.07 MB (0.0% of available memory)
Data preprocessing and feature engineering runtime = 0.17s ...
AutoGluon will gauge predictive performance using evaluation metric: 'accuracy'
To change this, specify the eval_metric parameter of Predictor()
Automatically generating train/validation split with holdout_frac=0.2, Train Rows: 712, Val Rows: 179
User-specified model hyperparameters to be fit:
{
'NN_TORCH': {},
'GBM': [{'extra_trees': True, 'ag_args': {'name_suffix': 'XT'}}, {}, 'GBMLarge'],
'CAT': {},
'XGB': {},
'FASTAI': {},
'RF': [{'criterion': 'gini', 'ag_args': {'name_suffix': 'Gini', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'entropy', 'ag_args': {'name_suffix': 'Entr', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'squared_error', 'ag_args': {'name_suffix': 'MSE', 'problem_types': ['regression', 'quantile']}}],
'XT': [{'criterion': 'gini', 'ag_args': {'name_suffix': 'Gini', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'entropy', 'ag_args': {'name_suffix': 'Entr', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'squared_error', 'ag_args': {'name_suffix': 'MSE', 'problem_types': ['regression', 'quantile']}}],
'KNN': [{'weights': 'uniform', 'ag_args': {'name_suffix': 'Unif'}}, {'weights': 'distance', 'ag_args': {'name_suffix': 'Dist'}}],
}
Fitting 13 L1 models ...
Fitting model: KNeighborsUnif ...
/home/coco/anaconda3/envs/py38/lib/python3.8/site-packages/torch/cuda/__init__.py:497: UserWarning: Can't initialize NVML
warnings.warn("Can't initialize NVML")
Exception ignored on calling ctypes callback function: <function _ThreadpoolInfo._find_modules_with_dl_iterate_phdr.<locals>.match_module_callback at 0x7f2dd48085e0>
Traceback (most recent call last):
File "/home/coco/anaconda3/envs/py38/lib/python3.8/site-packages/threadpoolctl.py", line 400, in match_module_callback
self._make_module_from_path(filepath)
File "/home/coco/anaconda3/envs/py38/lib/python3.8/site-packages/threadpoolctl.py", line 515, in _make_module_from_path
module = module_class(filepath, prefix, user_api, internal_api)
File "/home/coco/anaconda3/envs/py38/lib/python3.8/site-packages/threadpoolctl.py", line 606, in __init__
self.version = self.get_version()
File "/home/coco/anaconda3/envs/py38/lib/python3.8/site-packages/threadpoolctl.py", line 646, in get_version
config = get_config().split()
AttributeError: 'NoneType' object has no attribute 'split'
0.6536 = Validation score (accuracy)
0.57s = Training runtime
0.03s = Validation runtime
Fitting model: KNeighborsDist ...
Exception ignored on calling ctypes callback function: <function _ThreadpoolInfo._find_modules_with_dl_iterate_phdr.<locals>.match_module_callback at 0x7f2dd230f0d0>
Traceback (most recent call last):
File "/home/coco/anaconda3/envs/py38/lib/python3.8/site-packages/threadpoolctl.py", line 400, in match_module_callback
self._make_module_from_path(filepath)
File "/home/coco/anaconda3/envs/py38/lib/python3.8/site-packages/threadpoolctl.py", line 515, in _make_module_from_path
module = module_class(filepath, prefix, user_api, internal_api)
File "/home/coco/anaconda3/envs/py38/lib/python3.8/site-packages/threadpoolctl.py", line 606, in __init__
self.version = self.get_version()
File "/home/coco/anaconda3/envs/py38/lib/python3.8/site-packages/threadpoolctl.py", line 646, in get_version
config = get_config().split()
AttributeError: 'NoneType' object has no attribute 'split'
0.6536 = Validation score (accuracy)
0.02s = Training runtime
0.01s = Validation runtime
Fitting model: LightGBMXT ...
0.8101 = Validation score (accuracy)
0.21s = Training runtime
0.0s = Validation runtime
Fitting model: LightGBM ...
0.8268 = Validation score (accuracy)
0.21s = Training runtime
0.0s = Validation runtime
Fitting model: RandomForestGini ...
0.8156 = Validation score (accuracy)
0.29s = Training runtime
0.03s = Validation runtime
Fitting model: RandomForestEntr ...
0.8212 = Validation score (accuracy)
0.27s = Training runtime
0.02s = Validation runtime
Fitting model: CatBoost ...
0.8268 = Validation score (accuracy)
0.56s = Training runtime
0.0s = Validation runtime
Fitting model: ExtraTreesGini ...
0.8045 = Validation score (accuracy)
0.27s = Training runtime
0.02s = Validation runtime
Fitting model: ExtraTreesEntr ...
0.7989 = Validation score (accuracy)
0.29s = Training runtime
0.03s = Validation runtime
Fitting model: NeuralNetFastAI ...
No improvement since epoch 9: early stopping
0.8268 = Validation score (accuracy)
0.62s = Training runtime
0.01s = Validation runtime
Fitting model: XGBoost ...
0.8212 = Validation score (accuracy)
0.2s = Training runtime
0.0s = Validation runtime
Fitting model: NeuralNetTorch ...
0.838 = Validation score (accuracy)
1.67s = Training runtime
0.01s = Validation runtime
Fitting model: LightGBMLarge ...
0.8268 = Validation score (accuracy)
0.34s = Training runtime
0.0s = Validation runtime
Fitting model: WeightedEnsemble_L2 ...
0.8603 = Validation score (accuracy)
0.33s = Training runtime
0.0s = Validation runtime
AutoGluon training complete, total runtime = 6.31s ... Best model: "WeightedEnsemble_L2"
TabularPredictor saved. To load, use: predictor = TabularPredictor.load("AutogluonModels/ag-20230917_141245/")
<autogluon.tabular.predictor.predictor.TabularPredictor at 0x7f2da6b8c490>
-
리더보드확인 (모의고사 채점)
predictr.leaderboard()
model score_val pred_time_val fit_time pred_time_val_marginal fit_time_marginal stack_level can_infer fit_order
0 WeightedEnsemble_L2 0.860335 0.039553 2.707510 0.000509 0.333600 2 True 14
1 NeuralNetTorch 0.837989 0.007419 1.666336 0.007419 1.666336 1 True 12
2 LightGBMLarge 0.826816 0.002828 0.336571 0.002828 0.336571 1 True 13
3 LightGBM 0.826816 0.003033 0.210566 0.003033 0.210566 1 True 4
4 CatBoost 0.826816 0.003410 0.561348 0.003410 0.561348 1 True 7
5 NeuralNetFastAI 0.826816 0.006794 0.623764 0.006794 0.623764 1 True 10
6 XGBoost 0.821229 0.004909 0.202243 0.004909 0.202243 1 True 11
7 RandomForestEntr 0.821229 0.022999 0.271596 0.022999 0.271596 1 True 6
8 RandomForestGini 0.815642 0.025080 0.294840 0.025080 0.294840 1 True 5
9 LightGBMXT 0.810056 0.002956 0.207451 0.002956 0.207451 1 True 3
10 ExtraTreesGini 0.804469 0.022679 0.268605 0.022679 0.268605 1 True 8
11 ExtraTreesEntr 0.798883 0.025636 0.289557 0.025636 0.289557 1 True 9
12 KNeighborsDist 0.653631 0.013097 0.015496 0.013097 0.015496 1 True 2
13 KNeighborsUnif 0.653631 0.033604 0.567649 0.033604 0.567649 1 True 1
model | score_val | pred_time_val | fit_time | pred_time_val_marginal | fit_time_marginal | stack_level | can_infer | fit_order | |
---|---|---|---|---|---|---|---|---|---|
0 | WeightedEnsemble_L2 | 0.860335 | 0.039553 | 2.707510 | 0.000509 | 0.333600 | 2 | True | 14 |
1 | NeuralNetTorch | 0.837989 | 0.007419 | 1.666336 | 0.007419 | 1.666336 | 1 | True | 12 |
2 | LightGBMLarge | 0.826816 | 0.002828 | 0.336571 | 0.002828 | 0.336571 | 1 | True | 13 |
3 | LightGBM | 0.826816 | 0.003033 | 0.210566 | 0.003033 | 0.210566 | 1 | True | 4 |
4 | CatBoost | 0.826816 | 0.003410 | 0.561348 | 0.003410 | 0.561348 | 1 | True | 7 |
5 | NeuralNetFastAI | 0.826816 | 0.006794 | 0.623764 | 0.006794 | 0.623764 | 1 | True | 10 |
6 | XGBoost | 0.821229 | 0.004909 | 0.202243 | 0.004909 | 0.202243 | 1 | True | 11 |
7 | RandomForestEntr | 0.821229 | 0.022999 | 0.271596 | 0.022999 | 0.271596 | 1 | True | 6 |
8 | RandomForestGini | 0.815642 | 0.025080 | 0.294840 | 0.025080 | 0.294840 | 1 | True | 5 |
9 | LightGBMXT | 0.810056 | 0.002956 | 0.207451 | 0.002956 | 0.207451 | 1 | True | 3 |
10 | ExtraTreesGini | 0.804469 | 0.022679 | 0.268605 | 0.022679 | 0.268605 | 1 | True | 8 |
11 | ExtraTreesEntr | 0.798883 | 0.025636 | 0.289557 | 0.025636 | 0.289557 | 1 | True | 9 |
12 | KNeighborsDist | 0.653631 | 0.013097 | 0.015496 | 0.013097 | 0.015496 | 1 | True | 2 |
13 | KNeighborsUnif | 0.653631 | 0.033604 | 0.567649 | 0.033604 | 0.567649 | 1 | True | 1 |
-
validation set의 의미:
D. 예측 (predict)
-
비유: 학습이후에 문제를 푸는 과정으로 비유할 수 있다.
-
training set 을 풀어봄 (predict) \(\to\) 점수 확인
== predictr.predict(_tr)).mean() (tr.Survived
0.9438832772166106
-
test set 을 풀어봄 (predict) \(\to\) 점수 확인 하러 캐글에 결과제출
= predictr.predict(_tst)).loc[:,['PassengerId','Survived']]\
tst.assign(Survived "autogluon(Fsize,Drop)_submission.csv",index=False) .to_csv(